# Japanese Visual Question Answering

Heron NVILA Lite 1B
Apache-2.0
A Japanese visual language model trained based on the NVILA-Lite architecture, supporting image-text interaction in both Japanese and English
Image-to-Text Safetensors Supports Multiple Languages
H
turing-motors
460
2
Sarashina2 Vision 14b
MIT
Sarashina2-Vision-14B is a large Japanese visual language model developed by SB Intuitions, combining Sarashina2-13B with Qwen2-VL-7B's image encoder, achieving excellent performance in multiple benchmarks.
Image-to-Text Transformers Supports Multiple Languages
S
sbintuitions
192
6
Sarashina2 Vision 8b
MIT
Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.
Image-to-Text Transformers Supports Multiple Languages
S
sbintuitions
1,233
4
Llm Jp 3 Vila 14b
A large-scale vision-language model developed by Japan's National Institute of Informatics, supporting Japanese and English with strong image understanding and text generation capabilities.
Image-to-Text Japanese
L
llm-jp
106
10
Convllava JP 1.3b 1280
ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.
Image-to-Text Transformers Japanese
C
toshi456
31
1
Llava Calm2 Siglip
Apache-2.0
llava-calm2-siglip is an experimental vision-language model capable of answering questions about images in Japanese and English.
Image-to-Text Transformers Supports Multiple Languages
L
cyberagent
3,930
25
Chat Vector Llava V1.5 7b Ja
A visual-language model capable of conducting dialogues in Japanese about input images, created using the Chat Vector method by combining weights from multiple models
Image-to-Text Transformers Japanese
C
toshi456
26
1
Llava Jp 1.3b V1.1
LLaVA-JP is a multimodal vision-language model that supports Japanese, capable of understanding and generating descriptions and dialogues about input images.
Image-to-Text Transformers Japanese
L
toshi456
90
11
Evovlm JP V1 7B
Apache-2.0
EvoVLM-JP-v1-7B is an experimental general-purpose Japanese vision-language model created using evolutionary model fusion methods
Image-to-Text Transformers Japanese
E
SakanaAI
46
36
Heron Chat Blip Ja Stablelm Base 7b V1 Llava 620k
A vision-language model capable of conversing about input images, supporting Japanese interaction
Image-to-Text Transformers Japanese
H
turing-motors
25
3
Heron Chat Blip Ja Stablelm Base 7b V1
This is a vision-language model capable of engaging in dialogue about input images, supporting Japanese communication.
Image-to-Text Transformers Japanese
H
turing-motors
40
3
Llava Jp 1.3b V1.0
LLaVA-JP is a Japanese visual language model capable of engaging in dialogue about input images, fine-tuned from llm-jp-1.3b-v1.0 using the LLaVA method.
Image-to-Text Transformers Japanese
L
toshi456
30
5
Heron Chat Git ELYZA Fast 7b V0
A vision-language model capable of conducting dialogues based on input images, supporting Japanese interaction
Image-to-Text Transformers Japanese
H
turing-motors
17
3
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase